Skip to content

Tool response quality, turn tracing, agent eval CLI — release 0.7.0#6

Merged
morganlinton merged 6 commits into
mainfrom
fix/tool-response-quality
Jun 10, 2026
Merged

Tool response quality, turn tracing, agent eval CLI — release 0.7.0#6
morganlinton merged 6 commits into
mainfrom
fix/tool-response-quality

Conversation

@morganlinton

Copy link
Copy Markdown
Contributor

Summary

  • fix: tool response quality for three model-facing edge cases (file_read offset past EOF, list_dir truncation total, grep unparseable lines)
  • refactor: split the 3,000-line commands/mod.rs into config_cmds, context_cmds, memory, session
  • feat: turn tracing — /trace view, .sessions/<id>.events.jsonl event log, turn timing footer
  • feat: agent eval CLI (--eval) with mock-SSE integration tests and an optional nightly CI job
  • chore: release 0.7.0 (Cargo version, CHANGELOG, README badge)

Each commit compiles and passes cargo test, clippy, and fmt independently.

Merging with a merge commit (not squash) to keep the per-feature commits; v0.7.0 will be tagged on the release commit after merge, which triggers the release workflow and the Homebrew tap update.

🤖 Generated with Claude Code

morganlinton and others added 6 commits June 8, 2026 06:46
…t convention

Models trained on Claude Code's Edit tool send file_edit({old_text: "",
new_text: <content>}) to create new files. SmallHarness's file_edit
previously returned "File not found" immediately, forcing a retry loop
(mkdir → touch → file_edit) that wasted 2-3 extra API round-trips.

Now: a single edit with old_text="" on a missing file creates the file
(including parent dirs), matching the Claude Code convention exactly.
Non-creation cases that hit "not found" or "old_text is empty" now also
include "Use file_write to create new files" for faster recovery.

Adds two tests: one for the new creation path, one confirming the
non-empty-old_text-on-missing-file error still fires with the hint.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
file_read: offset past EOF now returns a clear error instead of silently
returning empty content, which caused models to think files were empty
and retry with different offsets.

list_dir: add "total" field to every response so models know the real
directory size when truncated (count capped at 500 but total reflects
the actual entry count).

grep: switch map → filter_map so unparseable rg output lines (e.g.
binary-file notices) are dropped rather than emitted as malformed
{content: "..."} objects missing the file and line fields. Also moves
.take(100) after filter_map to ensure up to 100 *parseable* matches.
Adds the first test module for grep.rs (6 new tests total across the
three files).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y, session

commands/mod.rs had grown past 3,000 lines. Move the command handlers into
four focused submodules — config_cmds (/config, /backend, /model, /verbose…),
context_cmds (/context, /compact, /reset, /checkpoints), memory (/index,
/map, /memory, /remember, /forget), and session (/new, /undo, /session,
/resume, /export, /path) — leaving dispatch and the command list in mod.rs.
No behavior change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add turn_trace: every turn appends structured events (tool calls with
redacted args, approvals, compaction, warmup, timing) to a sidecar at
.sessions/<session-id>.events.jsonl, enabled by default via
display.eventLog.enabled. API keys and sensitive object keys are redacted
before anything is written.

/trace on|off surfaces nested subagent/critic tool calls as indented lines
in the TUI — previously their activity was invisible (events swallowed) —
without flooding the parent context. Tool calls now carry a depth field, and
the subagent/evaluator tools forward their inner events when tracing is on.

The end-of-turn status line gains a timing breakdown (TTFT, model, tools,
approval, total), the loader shows which tool is running, compaction of
oversized tool output is now reported to the user with the original size,
and /export <session> events copies the event log. Also prints a context
pressure notice as the prompt budget nears the model's effective limit.

/export current events copies the sidecar; /new and /resume reset it to the
active session.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…CI job

small-harness --eval <fixture> [--model M] [--json] runs a bundled agent
eval fixture from the shell and exits 0 on pass / 1 on fail, so evals can
gate CI. A new optional macos CI job runs two fixtures against Ollama
nightly or when a commit message contains [eval]; it is continue-on-error
so a flaky local model never blocks merges.

Add agent_integration_test: drives the real agent loop against a mock
OpenAI-compatible SSE server (no live LLM) covering a tool-call round trip
plus eval checks, and the hit_step_limit cutoff flag.

Two fixes surfaced while wiring this up: the rubric heading parser now
matches "(weight:" case-insensitively on raw bytes instead of byte offsets
from a lowercased copy (which can diverge for some Unicode chars), and the
HTTP client gets a 10s connect timeout so a dead backend fails fast instead
of hanging — without capping long streaming completions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@morganlinton morganlinton merged commit 57a6175 into main Jun 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant